Search - add case insensitive flag for "term" family of queries #61596

markharwood · 2020-08-26T17:39:38Z

Adds case insensitive option to term, terms, terms_in_set, prefix, wildcard queries
First cut.
Closes #61546

elasticmachine · 2020-08-26T17:39:40Z

Pinging @elastic/es-search (:Search/Search)

server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveAutomatonQuery.java

jimczi

I wonder if we should add new functions rather than parameters ?
Something like termQueryCI, prefixQueryCI and wildcardCI with a default implementation that throws a QueryShardException ?

markharwood · 2020-09-01T10:12:11Z

I wonder if we should add new functions rather than parameters ?

I was following the approach of using parameters that we introduced in regexpQuery.
We could have a version of termQuery etc without the param that delegates to the with-param version using the default of false. I originally made a hard break with the method signature to make sure the compiler pointed me at all uses of these functions. Having reviewed all the uses maybe we could look at a less invasive change (your termsQueryCI or my suggestion of a parameter-less default). Any preference before I start a refactor @jimczi?

jimczi · 2020-09-04T12:44:12Z

I was following the approach of using parameters that we introduced in regexpQuery.

For regexQuery I think it's different since it's only relevant for text and keyword field. Term queries are implemented by almost all field types so I think we can have at least a differentiation there ?

markharwood · 2020-09-07T13:00:00Z

I'm going to copy the regexp docs for this flag:

(Optional, boolean) allows case insensitive matching of the regular expression value with the indexed field values when set to true. Setting to false is disallowed.

I'll not include the flag in the example query though (despite this helping improve test coverage). The reason is that I want to avoid promoting the use of this query-time flag if an index-time choice (eg using a normalizer) is typically the better option. I'm not clear if the above is the best wording or how to present the choices between index-time and query-time case sensitivity options. It seems a pain to dive into that detailed discussion on every type of query that supports the new case insensitive flag.

markharwood · 2020-09-07T14:54:49Z

I made the changes you suggested, @jimczi
As mentioned, I'd appreciate your eyes on this @nik9000 for the changes to scripted fields.

I feel uneasy about query docs for the flag - it would be nice to point to somewhere central that outlines the trade offs for case insensitivity at query-time Vs index-time

nik9000

The runtime fields stuff seems right to me. I've added @javanna as a reviewer to make sure he gets a look at it too. We're trying to stay pretty synchronized on this code.

server/src/main/java/org/elasticsearch/common/regex/Regex.java

nik9000

The runtime fields stuff seems right to me. I've added @javanna as a reviewer to make sure he gets a look at it too. We're trying to stay pretty synchronized on this code.

javanna

I have a naming concern, can we not use the CI acronym in the new method names? I find them hard to read, and makes it hard for readers to figure out what CI means in the context. I personally think that a slightly longer name is not a big problem.

server/src/main/java/org/elasticsearch/index/mapper/ConstantFieldType.java

markharwood · 2020-09-09T10:07:13Z

Thanks for the review, @javanna . I changed the name.

jimczi

I left some comments, thanks for adding the distinction for term queries

server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveAutomatonQuery.java

server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitivePrefixQuery.java

server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveTermQuery.java

server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveWildcardQuery.java

docs/reference/query-dsl/prefix-query.asciidoc

server/src/main/java/org/elasticsearch/index/mapper/IndexFieldMapper.java

...src/main/java/org/elasticsearch/xpack/runtimefields/mapper/ScriptBooleanMappedFieldType.java

...s/src/main/java/org/elasticsearch/xpack/runtimefields/query/StringScriptFieldTermsQuery.java

jimczi · 2020-09-09T11:37:14Z

server/src/main/java/org/elasticsearch/index/mapper/MappedFieldType.java

+
+
+    // Case insensitive form of terms query
+    public Query termsQueryCaseInsensitive(List<?> values, @Nullable QueryShardContext context) {        


I wonder what we should do here ? We'd need to add the support for case insensitive matching in TermInSetQuery if we want to avoid building giant boolean query in the keyword field. For now I think it's ok to not provide this option on terms query since we cannot support them efficiently ?

@jimczi terms query is currently used by eql for more obvious queries like file where file_name in ("wininit.exe", "lsass.exe") and (less obvious) for cidrMatch function for matching an IP address in a list of CIDR blocks. Would be useful to have case insensitive support for it as well.

(less obvious) for cidrMatch function for matching an IP address in a list of CIDR blocks.

Neat! I just want to make sure you aren't expanding the cidr blocks or anything - we natively support the cidr match in the term and terms query.

I just want to make sure you aren't expanding the cidr blocks or anything - we natively support the cidr match in the term and terms query.

We're not, we just pass the block to ES.
When dealing with an expression that forces us to do scripting, we use the underlying Lucene class for matching.

Nice.

Hopefully you'll be able to use runtime field before too long!

For now I think it's ok to not provide this option on terms query since we cannot support them efficiently ?

If we don't support it won't people just be forced to do what we do internally and create a bool should array of term queries?
Is there a query-complexity circuit-breaker we're somehow missing here?

Adding a +1 to have this supported in terms query. The alternative in EQL, in case we decide to not support this, would probably be (as @markharwood mentioned) to create a bool query with a bunch of term queries in it. CC @jimczi

My worry is that we'd internally create an array of should term queries because the TermsInSet query doesn't handle case insensitive queries. It's maybe not a big issue but that wouldn't be consistent since a case insensitive terms query would fail with more than 1024 terms (the max boolean clause limit) while the normal one would use the optimized TermsInSet query. So my take on this is that we should implement a proper support in TermInSet if we really want to handle terms case insensitive query. For EQL and SQL I don't think there's a real need to handle terms though. The in operator doesn't need to handle thousands of terms so imo that would be acceptable to translate it into an array of term query. Same for the cidrMatch function that also work on term queries.

OK - I'll pull the termsQueryCaseInsensitive method for now. We can always add in another PR later. This PR probably has enough changes to consider already

markharwood · 2020-09-16T11:01:08Z

I ran some benchmarks to see what effects this flag has on response times.
I used an index of 2m weblogs, indexing as wildcard and keyword and indexing the whole log record string (high cardinality) and just the user agent (low cardinality).
I then ran a variety of queries on these fields which helps show some of the performance characteristics:

Response Time (secs) for 1,000 queries on different fields in 2m doc index

Field type/query	Low cardinality (12k unique strings)	Hi cardinality (1.3m unique strings)
Keyword/ci term	8	57
Keyword/term	1	1
Keyword/ci wildcard	13	1098
Keyword/wildcard	11	1100
Wildcard/ci term	24	2.7
Wildcard/term	14	1.3
Wildcard/ci wildcard	56	129
Wildcard/wildcard	60	133

Some of the observations this helped me draw out:

A case insensitive (CI) term query is slower than a case sensitive one but nowhere near as expensive as any partial-value search on a high cardinality keyword field.
Keyword fields experience up to a 7x slowdown when adding CI but wildcard fields only have a ~2x slow down.
For CI or case sensitive term queries, the higher the cardinality, the faster the wildcard field searches (presumably less postings per unique term to be verified).

jimczi

The change looks good to me. Thanks for iterating on this. I'd prefer that we consider terms queries in a follow up since the current matching on keyword would not be able to leverage the TermsInSetQuery but I don't have strong feelings either. We can also optimize later so I let you decide on that aspect ;).

jimczi · 2020-09-18T11:53:45Z

server/src/main/java/org/elasticsearch/common/regex/Regex.java

+        return simpleMatchWithNormalizedStrings(pattern, str);
+    }
+
+    private static boolean simpleMatchWithNormalizedStrings(String pattern, String str) {


simpleMatchCaseInsensitive ?

That name might suggest the case insensitivity is built into the method (it's not).
I wanted something less explicit ("post-normalization"?) that suggested the strings were assumed to have already been normalised (which for argument's sake could have been uppercasing not lowercasing).

fair enough

…, prefix, wildcard queries Closes elastic#61546

markharwood · 2020-09-18T15:14:07Z

test this please

#62661) Backport of fe9145f Closes #61546

andreidan · 2020-10-08T10:14:12Z

Removed backport pending as this has been backported via #62661

jypan0115 · 2023-04-04T04:06:09Z

In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, for now it only works with ASCII characters. May I know why not support foreign characters like Vietnamese? It is not consist with keyword.

cbuescher · 2023-04-17T10:49:51Z

@jypan0115 your reading of the code is correct, as the comment in the code points out, the "insensitive option" currently only works for ASCII characters. Your you mind opening an enhancement issue pointing out where this fails e.g. for Vietnamese? I'm not sure how casing is handled in that language, but having an example will help us determine the implementation effort and prioritize this.

jypan0115 · 2023-04-18T10:09:08Z

@jypan0115 your reading of the code is correct, as the comment in the code points out, the "insensitive option" currently only works for ASCII characters. Your you mind opening an enhancement issue pointing out where this fails e.g. for Vietnamese? I'm not sure how casing is handled in that language, but having an example will help us determine the implementation effort and prioritize this.

@cbuescher I opened an issue about this https://github.com/elastic/elasticsearch/issues/95120 And here is a fail example, we have a field called name. If we store it as keyword field and use case insensitive term query to search for Ngô Đức(Uppercase) or ngô đức(lowercase), it works fine. But if it is stored as wildcard field, it failed when I use case insensitive term query. This can be fixed by removing these 3 lines in server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function.

 if (codepoint > 128) {
            return case1;
        }

And that's why I am asking why we only support ASCII characters in wildcard field. Hope this helps.

cbuescher · 2023-04-18T10:20:24Z

Thanks, I'll copy parts of this short note over to the issue.

markharwood added WIP :Search/Search Search-related issues that do not fall into other categories labels Aug 26, 2020

markharwood self-assigned this Aug 26, 2020

elasticmachine added the Team:Search Meta label for search team label Aug 26, 2020

markharwood force-pushed the fix/61546 branch 3 times, most recently from 5b9a7bf to cc70b86 Compare August 27, 2020 18:00

vijaykriishna reviewed Aug 29, 2020

View reviewed changes

server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveAutomatonQuery.java Outdated Show resolved Hide resolved

jimczi reviewed Aug 31, 2020

View reviewed changes

markharwood force-pushed the fix/61546 branch from cc70b86 to c3ea3ff Compare September 1, 2020 09:05

markharwood mentioned this pull request Sep 4, 2020

Fix invalid flag setting for RegExp #61976

Merged

markharwood force-pushed the fix/61546 branch 3 times, most recently from ab4bdfe to 053aee1 Compare September 7, 2020 12:45

markharwood force-pushed the fix/61546 branch from 828e907 to 0539abc Compare September 7, 2020 13:49

markharwood removed the WIP label Sep 7, 2020

markharwood requested a review from nik9000 September 7, 2020 14:52

nik9000 requested a review from javanna September 8, 2020 12:38

nik9000 reviewed Sep 8, 2020

View reviewed changes

server/src/main/java/org/elasticsearch/common/regex/Regex.java Outdated Show resolved Hide resolved

nik9000 reviewed Sep 8, 2020

View reviewed changes

javanna reviewed Sep 8, 2020

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/ConstantFieldType.java Outdated Show resolved Hide resolved

markharwood force-pushed the fix/61546 branch from 29d895a to b82974a Compare September 9, 2020 09:30

jimczi requested changes Sep 9, 2020

View reviewed changes

markharwood force-pushed the fix/61546 branch from 0c80418 to 698b62a Compare September 15, 2020 08:33

jimczi approved these changes Sep 18, 2020

View reviewed changes

markharwood force-pushed the fix/61546 branch 2 times, most recently from de8c3d2 to 8f9ad73 Compare September 18, 2020 13:16

First cut at adding case insensitive flag for term, terms, termsInSet…

685a419

…, prefix, wildcard queries Closes elastic#61546

markharwood force-pushed the fix/61546 branch from 8f9ad73 to 685a419 Compare September 18, 2020 14:37

Formatting fix

83f5658

markharwood merged commit fe9145f into elastic:master Sep 18, 2020

markharwood added backport pending v7.10.0 v8.0.0 labels Sep 18, 2020

markharwood added a commit that referenced this pull request Sep 22, 2020

Search - add case insensitive flag for "term" family of queries #61596 (

a0df0fb

#62661) Backport of fe9145f Closes #61546

astefan mentioned this pull request Sep 23, 2020

QL: Improve disjunction translations on the same field #62804

Closed

andreidan added >enhancement and removed backport pending labels Oct 8, 2020

markharwood mentioned this pull request Oct 15, 2020

[KQL] Should wildcard queries default to case-insensitive search? elastic/kibana#80591

Closed

swallez mentioned this pull request Oct 19, 2020

Update semantics or definition of case_insensitive in term queries #63893

Closed

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

markharwood mentioned this pull request Jan 11, 2021

Option for case insensitive search at runtime #61162

Closed

7 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

cbuescher mentioned this pull request Apr 18, 2023

Support Case Insensitive search for foreign characters in wildcard field type #95120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search - add case insensitive flag for "term" family of queries #61596

Search - add case insensitive flag for "term" family of queries #61596

markharwood commented Aug 26, 2020

elasticmachine commented Aug 26, 2020

jimczi left a comment

markharwood commented Sep 1, 2020 •

edited

Loading

jimczi commented Sep 4, 2020

markharwood commented Sep 7, 2020

markharwood commented Sep 7, 2020

nik9000 left a comment

nik9000 left a comment

javanna left a comment

markharwood commented Sep 9, 2020

jimczi left a comment

jimczi Sep 9, 2020

astefan Sep 9, 2020

nik9000 Sep 9, 2020

costin Sep 9, 2020

nik9000 Sep 9, 2020

markharwood Sep 10, 2020

astefan Sep 17, 2020

jimczi Sep 17, 2020

markharwood Sep 18, 2020 •

edited

Loading

markharwood commented Sep 16, 2020

jimczi left a comment

jimczi Sep 18, 2020

markharwood Sep 18, 2020

jimczi Sep 18, 2020

markharwood commented Sep 18, 2020

andreidan commented Oct 8, 2020

jypan0115 commented Apr 4, 2023

cbuescher commented Apr 17, 2023

jypan0115 commented Apr 18, 2023

cbuescher commented Apr 18, 2023



		// Case insensitive form of terms query
		public Query termsQueryCaseInsensitive(List<?> values, @Nullable QueryShardContext context) {

Search - add case insensitive flag for "term" family of queries #61596

Search - add case insensitive flag for "term" family of queries #61596

Conversation

markharwood commented Aug 26, 2020

elasticmachine commented Aug 26, 2020

jimczi left a comment

Choose a reason for hiding this comment

markharwood commented Sep 1, 2020 • edited Loading

jimczi commented Sep 4, 2020

markharwood commented Sep 7, 2020

markharwood commented Sep 7, 2020

nik9000 left a comment

Choose a reason for hiding this comment

nik9000 left a comment

Choose a reason for hiding this comment

javanna left a comment

Choose a reason for hiding this comment

markharwood commented Sep 9, 2020

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood Sep 18, 2020 • edited Loading

Choose a reason for hiding this comment

markharwood commented Sep 16, 2020

Response Time (secs) for 1,000 queries on different fields in 2m doc index

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markharwood commented Sep 18, 2020

andreidan commented Oct 8, 2020

jypan0115 commented Apr 4, 2023

cbuescher commented Apr 17, 2023

jypan0115 commented Apr 18, 2023

cbuescher commented Apr 18, 2023

markharwood commented Sep 1, 2020 •

edited

Loading

markharwood Sep 18, 2020 •

edited

Loading